An analysis of almost any social media data can can be rather telling of how subgroups of a population interact with each other on a large scale. We are interested in the content of these interactions and how they vary throughout the United States over the few days that our data spans.
Ever wanted to know what everyone’s been tweeting about? Well, thanks to Twitter’s use of the hasthag system, that’s already possible. But how about the most popular places everyone’s been tweeting from? Or how about a simplified way to see how all those twitter users are feeling? Thanks to some in-depth exploratory analyses from Dr. Jeff Goldsmith’s* (Columbia University’s Mailman School of Public Health) Team AwesomeTM, and courtesy of Followthehashtag’s publicly available twitter APIs, even this is possible.
In a rapidly changing and increasingly tech-based world, people now have the power to essentially react to global events happening thousands of miles away in real time. Social media as whole, but twitter especially, are some of the biggest domains for capturing these reactions. Our team’s motivation for this analysis comes from a desire to aggregate these reactions in as compact and sensible format as possible.
What questions are you trying to answer? How did these questions evolve over the course of the project? What new questions did you consider in the course of your analysis?
Our initial ideas for analysis were:
Identify different languages
Which accounts get the most traffic
Geolocation traffic
Any events happened during time period, measure influence
Percent that use emoji’s, hashtags
Positive/negative words, correlation with time?
Percent of tweets devoted to a given topic: political, food, hobbies/lifestyle, etc.
Correlation between location and sentiment
Advertising percentage, success
Content analysis: top tweets & retweets
Correlation of tweets and events of days covered
As we looked at the dataset and analyzed the variables available to us, we found that some of these questions were outside of the scope of the information or tools available to us.
We decided to focus on just a few areas:
Positive/negative words
Correlation with time
Correlation between location and sentiment
The dataset we used from Followthehashtag is a comprehensive but incomplete list of 200,000 tweets from users across the United States (and outside the U.S., but we focused on domestic tweets) from April 14, 2016 to April 16, 2016, which comes as an easy-to-access csv file within a zipped folder. For each tweet, user information such as name, location (latitude/longitude), number of followers, and the entire content of the tweet itself is given.
We chose this dataset mainly because it was already in a nice form, and needed minimal cleaning. It included the varaibles we were interested in, such as tweet content for sentiment analysis and location data for mapping.
# The sentiment function takes a really long time so I created a new data file so you don't have to run it
us_tweets <- read_csv("us_tweets.csv")
#gets rid of non alphabetic characters
us_tweets$tweet_content_stripped <- gsub("[^[:alpha:] ]", "", us_tweets$tweet_content)
#removes all words that are 1-2 letters long
us_tweets$tweet_content_stripped <- gsub(" *\\b[[:alpha:]]{1,2}\\b *", " ", us_tweets$tweet_content_stripped)
We used the Syuzhet package from GitHub (thank you Matthew Jockers!) to extract sentiments from tweet content. Our primary analyses consisted of mapping these tweets (using tweet location) as observable sentiments across the United States, which gives a nice aggregate picture of how the U.S. twitterverse was feeling during the dates mentioned above.
About the Sentiment function: Matthew Jockers’s sentiment function is essentially a dictionary that assigns different words to different sentiments. The general sentiments he uses in this function (and subsequently the ones we use in our analyses) are trust, joy, anger, sadness, fear, disgust, anticipation, and surprise. While some of these sentiments may not seem intuitive to use, altogether they form a relatively broad spectrum of moods and emotions which make for interesting analyses.
Visualizations, summaries, and exploratory statistical analyses. Justify the steps you took, and show any major changes to your ideas.
Our additional shiny repos can be found: here for all US and here for individual states.
What were your findings? Are they what you expect? What insights into the data can you make?
sentimentTotals <- data.frame(colSums(us_tweets[,c(20:27)]))
names(sentimentTotals) <- "count"
sentimentTotals <- cbind("sentiment" = rownames(sentimentTotals),
sentimentTotals)
sentimentTotals
## sentiment count
## anger anger 13605
## anticipation anticipation 52960
## disgust disgust 12668
## fear fear 19942
## joy joy 46690
## sadness sadness 21882
## surprise surprise 22067
## trust trust 76347
us_tweets_long <- gather(us_tweets, sentiment, count, anger:trust,
factor_key = TRUE)
us_tweets$hour <- as.POSIXct(us_tweets$hour, format = " %H:%M")
ggplot(data = us_tweets, aes(x = hour)) +
geom_histogram(stat = "count") +
xlab("Time") + ylab("Proportion of tweets") +
ggtitle("Number of Tweets per Hour") +
scale_x_datetime(labels = date_format("%H:%M"))
From this graph, we noticed that we are missing some time intervals in our data set. We are not sure why this is. The website from which we obtained the data must not have scraped for these times.
us_tweets$charsintweet <- sapply(us_tweets$tweet_content, function(x) nchar(x))
ggplot(data = us_tweets, aes(x = charsintweet)) +
geom_histogram(aes(fill = ..count..), binwidth = 8) +
theme(legend.position = "none") +
xlab("Characters per Tweet") +
ylab("Number of tweets") +
scale_fill_gradient(low = "midnightblue", high = "aquamarine4") +
xlim(0,150) +
ggtitle("Characters per Tweet")
ggplot(data = sentimentTotals, aes(x = sentiment, y = count)) +
geom_bar(aes(fill = sentiment), stat = "identity") +
theme(legend.position = "none") +
xlab("Sentiment") +
ylab("Total Count") +
ggtitle("Total Sentiment Score for All Tweets in Sample")
tweet_words <- us_tweets %>%
unnest_tokens(word, tweet_content_stripped)
data(stop_words)
tweet_words <-
anti_join(tweet_words, stop_words)
tweet_words %>%
count(word) %>%
with(wordcloud(word, n, max.words = 200,
random.order = FALSE,
rot.per = 0.35,
colors = brewer.pal(2, "Dark2")))
pal2 <- brewer.pal(8,"Dark2")
tweet_words %>%
count(word, sort = TRUE) %>%
top_n(10) %>%
mutate(word = fct_reorder(word, n)) %>%
ggplot(aes(x = word, y = n)) +
geom_bar(stat = "identity", fill = "blue", alpha = .6) +
coord_flip()
hashtags <- str_extract_all(us_tweets$tweet_content, "#\\S+")
hashtags <- unlist(hashtags)
hashtags <- gsub("[^[:alnum:] ]", "", hashtags)
hashtags <- tolower(hashtags)
hashtag.df <- data.frame(table(hashtags))
hashtag.df$hashtags <- as.character(hashtag.df$hashtags)
hashtag.df$Freq <- as.numeric(as.character(hashtag.df$Freq))
hashtag.df <- arrange(hashtag.df, desc(Freq))
print(hashtag.df[1:20,])
## hashtags Freq
## 1 job 51511
## 2 hiring 45428
## 3 jobs 21910
## 4 careerarc 20717
## 5 retail 7454
## 6 hospitality 7311
## 7 nursing 5091
## 8 healthcare 4702
## 9 veterans 4471
## 10 sales 3310
## 11 it 2179
## 12 customerservice 1927
## 13 transportation 1568
## 14 sonic 1520
## 15 manufacturing 1476
## 16 photo 1432
## 17 businessmgmt 1348
## 18 accounting 1053
## 19 engineering 970
## 20 traffic 955
When mapping the positive scores for all tweets, we see that there is a moderate to low score through the US. At this scale, we cannot see a definitive trend at the state level. However, we do see that there are not a lot of tweets generated in the midwest or north west. There does seem that there are slightly more positive tweets from the middle of the country.
#positive tweets, ggplot
us_tweets %>%
filter(country == "US") %>%
ggplot(aes(x = longitude, y = latitude, color = positive)) +
geom_point(alpha = .6) +
scale_colour_gradientn(colours = rainbow(10)) +
ggtitle("Positive Tweets")
When mapping sentiment across all US, we see an overwhelming amount of “trust” tweets. We are not quite sure what this emotion means. We found that most tweets including “job” or “jobs” mapped to the emotion “trust.” There are many tweets with those words, so it may be interesting to filter out that emotion.
#name of sentiment, ggplot
us_tweets_long %>%
filter(country == "US") %>%
filter(count > 0) %>%
ggplot(aes(x = longitude, y = latitude, color = factor(sentiment))) +
geom_point(alpha = .6)+
ggtitle("Tweet Sentiments") +
scale_color_discrete(name="Sentiment")
When we filter out trust, we see that surprise and joy seem to be commonly tweeted emotions.
#name of sentiment, ggplot
us_tweets_long %>%
filter(country == "US") %>%
filter(count > 0) %>%
filter(sentiment != "trust") %>%
ggplot(aes(x = longitude, y = latitude, color = factor(sentiment))) +
geom_point(alpha = .6)+
ggtitle("Tweet Sentiments") +
scale_color_discrete(name="Sentiment")
Due to the fact that our location column displays differences in specificity, we built a function that took the latitude and longitude of each tweet and converted it to the state in which the tweet originated from. We then proceeded to add that to our original dataset.
state_tweets = us_tweets %>%
select("longitude", "latitude")
latlong2state <- function(state_tweets) {
states <- map('state', fill=TRUE, col="transparent", plot=FALSE)
IDs <- sapply(strsplit(states$names, ":"), function(x) x[1])
states_sp <- map2SpatialPolygons(states, IDs=IDs,
proj4string=CRS("+proj=longlat +datum=WGS84"))
states_tweets_SP <- SpatialPoints(state_tweets,
proj4string=CRS("+proj=longlat +datum=WGS84"))
indices <- over(states_tweets_SP, states_sp)
stateNames <- sapply(states_sp@polygons, function(x) x@ID)
stateNames[indices]
}
state_name = latlong2state(state_tweets)
us_tweets = cbind(state_name, us_tweets)
To evaluate overall sentiment by state, we selected the appropriate columns, then grouped and summed by state, making sure to discount missing locations. Maine, Alaska and Hawaii were not included in this survey, however the 48 state count comes from Virginia and the District of Columbia recieving individual designations.
us_sentiments = us_tweets %>%
filter(country == "US") %>%
select(c(1, 21:30)) %>%
na.omit(state_name) %>%
group_by(state_name) %>%
summarise_all(funs(sum)) %>%
mutate(positive = as.numeric(positive),
negative = as.numeric(negative))
The following heatmap shows the level of positive and negative sentiment across the United States during the 48 hour period of our dataset. Maine, Alaska and Hawaii are blacked out as tweets from those states were not recorded.
We can observe with these two maps that states like California and Texas are consistently the highest ranked, which can be assumed to be population related. It is interesting because the state with the lowest positive and negative sentiment scores is Washington. This could be for two reasons: population difference or that twweets have less sentimental words than other states and therefore don’t generate as strong sentiment scores.
us_sentiments %>%
select("state_name", "negative") %>%
rename(region = state_name, value = negative) %>%
state_choropleth(title = "Negative Sentiment across the U.S.",
legend = "Sentiment Score")
us_sentiments %>%
select("state_name", "positive") %>%
rename(region = state_name, value = positive) %>%
state_choropleth(title = "Positive Sentiment Across the U.S.",
legend = "Sentiment Score")